Keyword and business tag extraction

ABSTRACT

A system to extract relevant keywords or business tags that describe a company&#39;s business is provided. The keyword extraction system utilizes a smart crawler to identify and crawl product pages from a company&#39;s website. These pages serve to provide textual descriptions of product offerings, solutions, or services that make up the company&#39;s business. The keyword extraction system combines these web documents with other textual descriptions of companies, e.g. from third party data vendors or other public data sources and company databases, to form a corpus of documents that describe companies. The corpus of documents and keywords are processed to segment the plurality of companies into subsets by applying a clustering technique and to provide visualization of the clusters with business tags.

CROSS-REFERENCE TO RELATED APPLICATIONS

This patent application is a continuation of U.S. application Ser. No.15/689,942, filed Aug. 29, 2017, which claims priority from U.S.Provisional Application No. 62/380,908 filed on Aug. 29, 2016, which areincorporated by reference herein.

FIELD

Implementations disclosed herein relate, in general, to informationmanagement technology and specifically to semantic analytics technology.

BACKGROUND

Marketing strategies commonly involve dividing a broad market ofprospects into subsets or segments of prospects that havecharacteristics in common, in the hope that they will have common needs,interests, or priorities. In the case that prospects are individualhuman consumers, such characteristics can include, but are not limitedto, demographic information about the age, sex, race, religion,occupation, income, or education level, geographic information about theprospect's location within regions, countries, states, cities,neighborhoods, or other locales, and behavioral and psychographicinformation about the lifestyle, attitude towards and response tocertain products or other stimuli. In the case that prospects arecompanies, e.g. in business-to-business (B2B) marketing, suchcharacteristics commonly include firmographic information, such as thecompany size, revenue, industry, and location. Marketers can applystrategies that are specialized for each segment, e.g. by creatingmessaging content or advertisements that resonate with, or are morerelevant to the target prospect, which lead to much better conversionrates.

In the same vein, sales development teams and account executives achievebetter outcomes if they research the prospect's background orcharacteristics and personalize their outreach efforts. As an example,in B2B situations, providing a case study or success story about acurrent customer similar to the prospect company is a powerful strategyto convince the prospect to purchase a product or service because itprovides evidence of previous success and reduces the perceived risk bythe prospect. The ability to semantically describe, group, and identifysimilar companies can be viewed as a form of business micro-segmentationthat is much more specific than segmenting using broad industry labelsto describe prospect companies, and is in turn more powerful andactionable.

SUMMARY

A system to extract relevant keywords or business tags that describe acompany's business is provided. The keyword extraction system utilizes asmart crawler to identify and crawl product pages from a company'swebsite. These pages serve to provide textual descriptions of productofferings, solutions, or services that make up the company's business.The keyword extraction system combines these web documents with othertextual descriptions of companies, e.g. from third party data vendors orother public data sources and company databases, to form a corpus ofdocuments that describe companies. The corpus of documents and keywordsare processed to segment the plurality of companies into subsets byapplying a clustering technique and to provide visualization of theclusters with business tags.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter. Otherfeatures, details, utilities, and advantages of the claimed subjectmatter will be apparent from the following more particular writtenDetailed Description of various embodiments and implementations asfurther illustrated in the accompanying drawings and defined in theappended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

A further understanding of the nature and advantages of the presenttechnology may be realized by reference to the figures, which aredescribed in the remaining portion of the specification. In the figures,like reference numerals are used throughout several figures to refer tosimilar components. In some instances, a reference numeral may have anassociated sub-label consisting of a lower-case letter to denote one ofmultiple similar components. When reference is made to a referencenumeral without specification of a sub-label, the reference is intendedto refer to all such multiple similar components.

FIG. 1 discloses example operations of the predictive analyticstechnology disclosed herein.

FIG. 2 discloses alternate example operations of the predictiveanalytics technology disclosed herein.

FIG. 3 discloses an example block diagram illustrating how topics thatexist within a document are used to boost the relevance score ofindividual keywords.

FIG. 4 illustrates an alternate example block diagram illustrating howtopics that exist within a document are used to boost the relevancescore of individual keywords that are related to that topic according toan implementation disclosed herein.

FIG. 5 illustrates an example accomplishment of a non-overlap algorithmdisclosed herein.

FIG. 6 illustrates example clusters of companies related to each other.

FIG. 7 illustrates an example view of the clusters with various businesstags.

FIG. 8 illustrates an example system that may be useful in implementingthe described predictive analytics technology disclosed herein.

FIG. 9 illustrates example diagrams describing determining similarity ofkeywords.

FIG. 10 illustrates example operations for determining similarity ofkeywords.

DESCRIPTION

Disclosed herein is an automated system and method to extract relevantkeywords (i.e. business tags) that describe a company's business.

FIG. 1 illustrates a series of operations used to extract business tagsthat describe a company's business. At an operation 102, the systemdisclosed herein uses a smart crawler to identify and crawl web pagesfrom a number of companies' websites. For example, such companies may beall the companies that may use products or services from a client forwhich the system disclosed herein is performing keyword and business tagextraction. However, in an alternative implementation, the operation 102crawls the websites of companies globally. Alternatively, the operation102 only crawls product web pages of the companies' websites. Yetalternatively, the operation 102 may crawl websites of only selectedtarget companies.

These pages of the companies' website serve to provide textualdescriptions of product offerings, solutions, or services that make upthe companies' business. For example, a web page of a target companythat is in the business of selling footwear may provide informationabout what kind of footwear the target company is selling, the pricepoint of the footwear, target market for the footwear, etc. Theoperation 102 identifies a number of keywords related to variouscompanies. The operation 102 performs smart crawling in that itdetermines which pages are appropriate for crawling, which keywords areappropriate, etc. For example, the operation 102 may determine that itis important to crawl product page but it is not necessary to crawl aterms and conditions page. Similarly, the operation 102 may determinethat it does not need to extract words such as “the,” “best,” etc., asthey do not necessarily describe products and services of the company.

In one implementation, the operation 102 outputs a list of keywordsextracted from the web pages for a company and the frequency of each ofsuch keywords. For example, for a company selling footwear, the keywordsmay be “shoes,” “sandals,” “running,” etc. The frequency at which eachof these keywords is extracted from the web pages may also be tabulated.In one implementation, the operation 102 may output a matrix of a largenumber of companies and keywords for each of these companies.

Subsequently, an operation 104 combines these web documents with othertextual descriptions of companies, e.g. from 3^(rd) party data vendorsor other public data sources and company databases, to form a corpus ofdocuments that describe companies. Thus, for example, the operation 104may extract keywords from other source, such as a news article, aLinkedIn™ page, Wikipedia™ page about the company, a consumer productreview website, AdWords purchased by the company, etc. Thus, in theexample of a company selling footwear, the operation 104 may combinetextual descriptions from such other sources—also referred to as thesecondary sources. The output of the operation 104 is used to enhancethe matrix generated at the operation 102.

Subsequently, an operation 106 extracts keyword phrases from the textdescriptions and counts keyword phrases that appear for each company,forming a vector of term frequencies to represent each company, where aterm is an n-gram, a chain of n words. Specifically, the operation 106generates a list of candidate descriptive phrases that may provide adescription of a company. For example, for the company selling footwear,one such phrase may be “running shoe”. Another such phrase may be“low-impact shoe”, etc. The operation 106 extracts such keyword phrasesfor the company and documents the frequency of each of these keywordphrases. In one implementation, the candidate descriptive phrases aregenerated by aggregating the keywords from the company web pages andfrom the secondary sources, meta keywords and meta descriptors from thecompany web pages and from the secondary sources. The operation 106aggregates these descriptive phrases and for the company and generatesthe count of those phrases.

The descriptive phrases are also referred to as n-grams. For example,for a company selling footwear, a monogram may be “shoes”, a bi-gram maybe “running shoe”, a tri-gram may be “altitude running shoe”, etc. Theoperation 106 generates the count for each such n-grams related to thecompany. In one implementation, the operation 106 generates the n-gramsacross the websites of the companies globally to determine the n-gramsthat are used more often to describe a company or a product. Each of then-grams in this collection of n-grams is related to a count of how oftenthe n-gram occurs. In one implementation, an n-gram having a highercount is ranked higher.

An operation 108 computes document frequencies (DF) for each consideredphrases or n-grams across the entire corpus, defined as the number ofcompanies whose text descriptions contained that phrase. In oneimplementation, to compute the DF for each n-gram, a website isconsidered one document.

Alternatively, each web page may be considered a document. Thus, if then-gram “altitude running shoe” shows up on 300 web pages, includingcompany pages, news sources, Wikipedia pages, etc., the n-gram “altituderunning shoe” is given a document frequency of 300. In oneimplementation, the count may be evaluated as a percentage of the totaldocuments in the universe. For example, if the system is evaluating amillion documents, the document frequency of 300 may indicate the phrase“altitude running shoe” to be important and descriptive, while a wordlike “the” is deemed unimportant because it appears in nearly allmillion documents. In yet alternative implementation, each occurrence ofan n-gram is given a weight based on the documents that the n-gram isfrom. For example, an n-gram appearing on a Wikipedia document may begiven a higher weight compared to an n-gram appearing on a socialnetwork document.

To reduce the contribution of phrases that are very common, an operation110 applies a term-frequency (TF)-inverse-document-frequency (IDF)(TF-IDF) transformation. Here, the term-frequency (TF) emphasizesphrases that appear multiple times within the document, while theinverse document frequency (IDF) de-emphasizes phrases that are commonacross documents, and emphasizes phrases that are rarer, moredescriptive, or salient. For example, if an n-gram “running shoe”appears 10 times in a document, it has the TF of 10 for the document. Onthe other hand, if the n-gram “running shoe” is common across alldocuments, it may be a very common n-gram and its inverse frequencyacross all documents (IDF) de-emphasizes the importance of that n-gram.Thus, the TF is a frequency per document and the IDF is inverselyproportional to the frequency across the entire corpus of documents. TheTF may be generated based on output from the operation 106, whereas theIDF may be calculated based on the output of the operation 108.

The term-frequency function is a function that increases with the numberof occurrences of an n-gram phrase in a document. An example is simplyTF=term_count, while in a sublinear scaling example,TF=1+log(term_count). The inverse-document-frequency function is afunction that decreases with the number of documents that contain then-gram phrase. An example formulation isIDF=log(num_total_documents/num_documents_with_term). The TF-IDF is themultiplicative product of TF and IDF.

While the TF-IDF transformation is good at scaling individual termsindependently based on occurrences within a document and occurrencesacross the corpus, it does not always work well for keyword ranking; theterms with the highest TF-IDF values are often not the terms that ahuman would consider to be most relevant descriptors of the company. Theunderlying problem is that TF-IDF does not take into consideration theco-occurrence of different keyword phrases within each document. Thepatterns of co-occurring words and phrases can be interpreted as“topics” within a document, and each company or document can be expectedto focus on a few topics or themes. A human typically identifiesrelevant keywords by considering both the saliency of the keyworditself, and whether the keyword is “on topic” within the context of thedocument. For each document, the operation 110 outputs a list of keyn-grams and a TF-IDF value for that key n-gram.

An operation 112 determines similarity of keywords. Specifically, theoperation 112 determines how similar any two keywords are to each other.The method for determining similarity of keywords is further disclosedbelow with respect to FIGS. 9 and 10.

An operation 114 applies a relevance transform by boosting the TF-IDFvalue of phrases within each document based on how on-topic it is. Oneof the inputs for the operation 114 is the keyword similarity valuegenerated at operation 112. A given document can be represented byn-grams and their corresponding strengths. Considering the co-occurrenceof n-grams within the document, also allows extracting a set of topics,their strengths, and their associated influences to-and-from theindividual n-grams. A generalized diagram of the n-gram and topicrelationships is shown below in FIGS. 3 and 4. In an exampleimplementation, the relevance scores for each n-gram can be calculatedas the n-gram strength times the weighted sum of the associated topicstrengths, i.e.

$r_{i} = {w_{i} \cdot {\sum\limits_{j = 1}^{k}\;( {e_{ji} \cdot t_{j}} )}}$

where r_(i) is the relevance of n-gram i, w_(i) is the strength ofn-gram i, e_(ji) is the influence or edge weight from topic j to n-grami, t_(i) is the strength of topic j, and k is the number of topics.

In one implementation, the topics can be selected to be the individualstemmed words that make up the n-grams. Stemming refers to the reductionof words to their word stem, base, or root form. For example, a bigram“mobile gaming” can be viewed as exhibiting two topics “mobile” and“game”, the stemmed forms of “mobile” and “gaming”. If there exist manyother unique phrases that are comprised of words that stem to “mobile”and “game”, such as “mobile applications” or “gaming equipment”, then itwould increase the topical strength of “mobile” and “game” within thisdocument, and every phrase linked to these topics would get boosted interms of relevance. One example function for assigning the topicstrength is 1+log(degree) where degree is the number of edges from thattopic to its associated n-grams within the document, or in other words,the number of unique n-grams that contain a word that stems to thattopic. In this case, the edge weights can simply be 1.0 when there is anassociation between an n-gram and a topic, and 0.0 (no edge) when then-gram does not contain a word that stems to the topic. A hypotheticalexample of this implementation, for a document about mobile gaming andgame development 400 is shown below in FIG. 4.

Similar Keywords and Phrases:

In FIG. 4, t_(i) is the strength of topic i, w_(i) is the n-gram tf-idfvalue, and r_(i) is the output relevance score. For example, the topic“game” is shared between three n-grams (“game development”, “mobilegaming”, and “mobile games”). While in FIG. 4 the edge weights 402between the topics 404 and the n-grams 408 are 1.0, in anotherimplementation, different edge weights 402 may be used. For example, analternative algorithm may determine that in the n-gram “gamedevelopment”, the phrase “game” only contributes 40% and, therefore, theedge 402 a may be given a weight of 0.40.

Furthermore, similarities between n-grams may also be used by a computerto determine the relevance scores. For example, the n-gram “mobilegames” and “mobile gaming” may be determined to be similar, in whichcase, the co-occurrence of these two n-grams being similar to each otherwithin one document can be used to boost the TF-IDF value of each ofthese two n-grams.

In other implementations, the topics, topic strengths, and n-gram-topicedge weights for each document can be extracted using techniques such asLatent Semantic Analysis, Latent Dirichlet Allocation, HierarchicalDirichlet Processes, Non-negative Matrix Factorization, and others, or acombination of methods. Similar to before, the topical strength can alsobe used to amplify the associated individual n-gram strengths to form ameasure of relevance for each n-gram.

The top-ranking keyword phrases by relevance score can be used asbusiness tags that succinctly describe a company's business or products.The dataset supports lookups by company to find the company'sdescriptive tags (as shown below in FIG. 2 by operation 222), andreverse lookups by business tag to retrieve all companies thatspecialize in that tag (as shown below in FIG. 2 by operation 220).Thus, FIG. 2 illustrates operations for providing keyword-to-companyrelations.

Keyword Relevance/Keyword to Company Search

In one implementation, an operation 116 generates relevance scores forvarious companies and keywords/phrases. For example, the operation 116may produce, for each company, a ranked and scored list of keywords.Thus, for a particular footwear company the keyword “boots” maybe rankedhigher than the term “sandal”, in which case, that particular companymay be more likely to sell, specialize in, known for, etc., for bootscompared to sandals. In one implementation, the operation 116 maydetermine such ranking based on the TF-IDF for the terms in thedocuments related to the company. For example, if the keyword “boots”appears in more documents for the particular footwear company comparedto the keyword “sandals”, “boots” is ranked higher than “sandals” forthat particular footwear company.

Similarly, the operation 116 may also produce for each keyword, a rankedand scored list of companies. Thus, for example, for the keyword “boot”a First Footwear Company may be ranked higher than a Second FootwearCompany, which may signify that the First Footwear Company is morelikely to sell, specialize in, known for, etc., for boots compared tothe Second Footwear Company. In one implementation, the operation 116may determine such ranking based on the TF-IDF for the term in thedocuments related to the companies. For example, if term “boots” appearsmore often in documents related to the First Footwear Company comparedto the documents related to the Second Footwear Company, the FirstFootwear Company is ranked higher than the Second Footwear Company forthe keyword “boots.” While the illustrated implementations of theoperations 100 disclose the operation 114 for boosting the TF-IDF valueand the operation 116 for determining keyword relevance, in alternativeimplementation, these operations may be combined.

Clustering and Cluster Tagging

The TF-IDF and Relevance based semantic representations of companies canbe used to directly drive product applications as well as implicitlysupport downstream machine learning applications. In one machinelearning application, Representation Learning techniques are applied byan operation 118 on the TF-IDF or relevance vectors to generalize orproject companies in the high dimensional n-gram space into a lowerdimensional topic space. Such techniques include using Singular ValueDecomposition, Latent Dirichlet Allocation, Hierarchical DirichletProcesses, Non-negative Matrix Factorization, Neural NetworkAutoencoders, and others. Companies that are close together in the topicspace, e.g. according to Euclidean or Cosine distance, are effectivelysimilar to each other in terms of their business, product offerings,solutions or services.

Given that similar companies are close together in the topic vectorspace, a clustering algorithm is also applied at an operation 120 toautomatically segment a broad set of companies into subsets or groups ofcompanies that are similar to each other. Such clustering techniquesinclude, but are not limited to, K-Means, Spectral Clustering, DBSCAN,OPTICS, Hierarchical Clustering, and Affinity Propagation.

A technique disclosed herein also allows to automatically extractrelevant n-gram keywords to describe each cluster of companies. For acluster or any set of companies, the constituent companies' n-gramvector representations are merged into one n-gram vector via anaggregating function, a simple example of which is just the vector sum.From this merged n-gram vector, the relevance scoring algorithmdescribed earlier is applied to boost the strengths of relevant n-grams,following the same principle that n-grams that are on-topic within thecluster should be considered more relevant. The top n-grams by relevancecan be used to tag each cluster so that they are readily humanunderstandable. FIG. 7 below illustrates a detailed view of the clusterswith various business tags 700, such as “Application Development”,“Mobile Products”, etc.

Visualization

Starting again from the notion that similar companies are close togetherin our semantic vector space, there is a lot of potential value in beingable to visualize the clusters or segments of similar companies within abroad set of companies. The key requirement of the visualizationtechnique is to be able to position entities that are close together inhigh dimensional space such that they are also close together in 2- or3-dimensional space in order to preserve and visualize the similaritystructure in an intuitive way. Some example techniques (sometimesreferred to as manifold learning) that satisfy this requirement aret-Distributed Stochastic Neighbors Embedding (t-SNE) andMulti-Dimensional Scaling (MDS).

An operation 122 provides cluster visualization with business tags, suchas the one illustrated below in FIGS. 6 and 7. The outputs of thevisualization technique are 2- or 3-dimensional position coordinates ofeach entity to be visualized. However, these positions are computedwithout consideration of the sizes of the points to be visualized, i.e.only the point centers are considered. This may lead to, for example,overlapped circular points in the final visualization as node sizes aresubsequently applied to try to convey additional information about eachentity. This is not ideal from a user experience or aestheticperspective, and can also obfuscate information if some points becomehidden behind others.

The disclosed technology provides a technique to address this issue, bypost-processing the positions according to a set of desired node sizesfor the entities. In one implementation, the non-overlap problemformulation for n points may be given by:

${minimize}\mspace{14mu}{\sum\limits_{i = 1}^{n}\;{{x_{i} - p_{i}}}_{2}}$subject  to  x_(i) − x_(j)₂ ≥ r_(i) + r_(j)  for  i > j

where x_(i) is the final layout position vector for point i to beoptimized, p_(i) is the original position vector of point i, and r_(i)is the desired radius of point i in the final visualization.Conceptually, the constraints are to ensure that no two circular pointsare overlapped, while the system tries to minimize the total movement ofpoints away from their original positions.

The problem with the above formulation is that the constraints are notconvex, thus it is not efficiently solvable. Therefore, a convexrestriction is applied by modifying the constraints to ensure that anytwo points, in two dimensions for example, must be separated by a regiondefined by two parallel lines, both perpendicular to a directionalconstraint unit vector pointing in the direction from the originalpositions of point j to point i, whose width is at least r_(i)+r_(j).This results in a smaller feasible set, leading to slightly suboptimalsolutions to the above problem formulation, but the optimization problembecomes convex and can be efficiently solved as a Quadratic Programusing, e.g. interior point methods or other standard convex solvers. Toget closer to the optimal solution of the original problem, multipleiterations of this convex optimization are run by using the solutionsx_(i) of the previous run to set the directional constraint unit vectorfor the next run.

To further optimize the computational efficiency, a large number of theconstraints can be removed without much consequence, because points thatare originally far apart from each other most likely will not violatethe non-overlap constraint even after applying node sizes. To this end,an implementation considers constraints only between each point and itsk nearest neighbors where k is much smaller than the number of points.

FIG. 2 illustrates various operations 200 for application of thedescribed technology in visualization of the clusters. Specifically,operations 202 to 216 are substantially similar to the operations102-112. An operation 220 allows searching companies based on relevantbusiness tags or keywords.

The operations 220 and 222 together provide ability to look up companiesby keywords and keywords by company. Thus, a user may input a keyword,such as “shoes” in a user interface and get a list of companies that arerelated to shoes. In one implementation, the list of companies is rankedas their relevance to the keyword “shoes”. Another operation 222 allowslooking up business tag based on companies. Alternatively, a user mayinput a keyword, such as “Shoe Company A” in a user interface and get alist of keywords that are related to Shoe Company A. In oneimplementation, the list of keywords is ranked as their relevance to the“Shoe Company A.”

An example illustration of what the non-overlap algorithm 500accomplishes is shown in FIG. 5. The original positions are labeled bydots at the flat end of the arrows, and the final positions areindicated by the pointed end of the arrows. Box constraintsx_(min)≤x_(i)≤x_(max) were added in this example to also bound thecircular points within a rectangular region.

In one product application, the customers or prospects of a client areanalyzed by clustering them based on our semantic representations,either in the keyword space or the more general topic space. Theseclusters each consist of companies that are similar in businessofferings to each other, and human understandable business tags can beextracted using our cluster tagging and relevance scoring technique. Ineffect, the clusters can be considered to be micro-segments on whichmarketers can craft specialized messages and content which resonate wellwith the personas of the companies in each of the micro-segments,leading to improved conversion rates.

In a related product application, a visualization of the clusters can beshown along with the business tags describing each cluster, to use as anintuitive user interface for clients to get an overview of theircustomers or prospects, see FIGS. 6 and 7 below. For example, FIG. 6illustrates a cluster of companies 600 related to online shop 602, acluster of companies related to capital management 604, etc. Additionalvaluable insights are provided via different marker sizes, colors, ortransparency values of the plot markers. In one example, the plotmarkers are sized according to contract size, so that our client canquickly see which micro-segments are most valuable to target and pursue.FIGS. 6 and 7 illustrate example visualizations exhibiting thenon-overlap layout, clustering, and cluster tagging. In another example,the visualization is animated according to when each opportunity or dealcame to exist, allowing our client to find emerging customer segments orsee how existing micro-segments evolved over time. Finally, thevisualization can also be used as a user interface for our clients toselect and create audience segments within our platform, e.g. using alasso-like tool.

Another product application provides a search engine for companies basedon the most relevant business tags that are extracted. This allowsmarketers and sales teams to quickly search through tens of millions ofbusinesses for specific target segments or companies that may havesimilar needs. For example, assume we have a client who is a hard drivemanufacturer and they have several exemplary customers that specializein video surveillance. Video surveillance companies typically have aneed for large amounts of hard drive storage to archive large videofiles. With the search engine, the client can easily find new “videosurveillance” businesses that were previously unknown to them, and reachout in a highly personalized way with relevant and successful casestudies of their exemplary customers.

In a synergistic application on this platform, the search engine resultsare ranked by their fit scores according to a client's own trainedcustomer model. In one implementation, the keyword search is tailoredfor searching businesses, because the keywords are extracted fromproduct pages and business descriptions, using a specialized relevancealgorithm, therefore yielding much more accurate search results. Bycoupling with fit scores, the search engine results are both accuratewith respect to the keyword query and simultaneously relevant to theclient's specific business.

FIG. 8 illustrates an example system 800 that may be useful inimplementing the described predictive analytics technology. The examplehardware and operating environment of FIG. 8 for implementing thedescribed technology includes a computing device, such asgeneral-purpose computing device in the form of a gaming console orcomputer 20, a mobile telephone, a personal data assistant (PDA), a settop box, or other type of computing device. In the implementation ofFIG. 8, for example, the computer 20 includes a processing unit 21, asystem memory 22, and a system bus 23 that operatively couples varioussystem components including the system memory to the processing unit 21.There may be only one or there may be more than one processing unit 21,such that the processor of computer 20 comprises a singlecentral-processing unit (CPU), or a plurality of processing units,commonly referred to as a parallel processing environment. The computer20 may be a conventional computer, a distributed computer, or any othertype of computer; the implementations are not so limited.

The system bus 23 may be any of several types of bus structuresincluding a memory bus or memory controller, a peripheral bus, aswitched fabric, point-to-point connections, and a local bus using anyof a variety of bus architectures. The system memory may also bereferred to as simply the memory, and includes read only memory (ROM) 24and random-access memory (RAM) 25. A basic input/output system (BIOS)26, containing the basic routines that help to transfer informationbetween elements within the computer 20, such as during start-up, isstored in ROM 24. The computer 20 further includes a hard disk drive 27for reading from and writing to a hard disk, not shown, a magnetic diskdrive 28 for reading from or writing to a removable magnetic disk 29,and an optical disk drive 30 for reading from or writing to a removableoptical disk 31 such as a CD ROM, DVD, or other optical media.

The hard disk drive 27, magnetic disk drive 28, and optical disk drive30 are connected to the system bus 23 by a hard disk drive interface 32,a magnetic disk drive interface 33, and an optical disk drive interface34, respectively. The drives and their associated tangiblecomputer-readable media provide nonvolatile storage of computer-readableinstructions, data structures, program modules and other data for thecomputer 20. It should be appreciated by those skilled in the art thatany type of tangible computer-readable media which can store data thatis accessible by a computer, such as magnetic cassettes, flash memorycards, digital video disks, random access memories (RAMs), read onlymemories (ROMs), and the like, may be used in the example operatingenvironment.

A number of program modules may be stored on the hard disk, removablemagnetic disk 29, removable optical disk 31, ROM 24, or RAM 25,including an operating system 35, one or more application programs 36,other program modules 37, and program data 38. A user may enter commandsand information into the personal computer 20 through input devices,such as a keyboard 40 and pointing device 42. Other input devices (notshown) may include a microphone (e.g., for voice input), a camera (e.g.,for a natural user interface (NUI)), a joystick, a game pad, a satellitedish, a scanner, or the like. These and other input devices are oftenconnected to the processing unit 21 through a serial port interface 46that is coupled to the system bus, but may be connected by otherinterfaces, such as a parallel port, game port, or a universal serialbus (USB). A monitor 47 or other type of display device is alsoconnected to the system bus 23 via an interface, such as a video adapter48. In addition to the monitor, computers typically include otherperipheral output devices (not shown), such as speakers and printers.

The computer 20 may operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer 49.These logical connections are achieved by a communication device coupledto or a part of the computer 20; the implementations are not limited toa particular type of communications device. The remote computer 49 maybe another computer, a server, a router, a network PC, a client, a peerdevice or other common network node, and typically includes many or allof the elements described above relative to the computer 20, althoughonly a memory storage device 50 has been illustrated in FIG. 8. Thelogical connections depicted in FIG. 8 include a local-area network(LAN) 51 and a wide-area network (WAN) 52. Such networking environmentsare commonplace in office networks, enterprise-wide computer networks,intranets and the Internet, which are all types of networks.

When used in a LAN-networking environment, the computer 20 is connectedto the local network 51 through a network interface or adapter 53, whichis one type of communications device. When used in a WAN-networkingenvironment, the computer 20 typically includes a modem 54, a networkadapter, a type of communications device, or any other type ofcommunications device for establishing communications over the wide areanetwork 52. The modem 54, which may be internal or external, isconnected to the system bus 23 via the serial port interface 46. In anetworked environment, program engines depicted relative to the personalcomputer 20, or portions thereof, may be stored in the remote memorystorage device. It is appreciated that the network connections shown areexample and other means of and communications devices for establishing acommunications link between the computers may be used.

In an example implementation, software or firmware instructions and datafor providing a search management system, various applications, searchcontext pipelines, search services, service, a local file index, a localor remote application content index, a provider API, a contextualapplication launcher, and other instructions and data may be stored inmemory 22 and/or storage devices 29 or 31 and processed by theprocessing unit 21.

Some embodiments may comprise an article of manufacture. An article ofmanufacture may comprise a tangible storage medium to store logic.Examples of a storage medium may include one or more types ofcomputer-readable storage media capable of storing electronic data,including volatile memory or non-volatile memory, removable ornon-removable memory, erasable or non-erasable memory, writeable orre-writeable memory, and so forth. Examples of the logic may includevarious software elements, such as software components, programs,applications, computer programs, application programs, system programs,machine programs, operating system software, middleware, firmware,software modules, routines, subroutines, functions, methods, procedures,software interfaces, application program interfaces (API), instructionsets, computing code, computer code, code segments, computer codesegments, words, values, symbols, or any combination thereof. In oneembodiment, for example, an article of manufacture may store executablecomputer program instructions that, when executed by a computer, causethe computer to perform methods and/or operations in accordance with thedescribed embodiments. The executable computer program instructions mayinclude any suitable type of code, such as source code, compiled code,interpreted code, executable code, static code, dynamic code, and thelike. The executable computer program instructions may be implementedaccording to a predefined computer language, manner or syntax, forinstructing a computer to perform a certain function. The instructionsmay be implemented using any suitable high-level, low-level,object-oriented, visual, compiled and/or interpreted programminglanguage.

FIG. 9 illustrates diagrams describing determining similarity ofkeywords 900. To compute how similar any two keyword phrases are to eachother, the system disclosed herein generates a vector representation foreach keyword phrase from which it then computes various distance orsimilarity metrics between a pair of vectors. Example metrics include,but are not limited to, cosine distance or Euclidean distance. This isloosely similar to the approach used herein to compute similaritybetween companies, in which companies are mapped into vectors and thensimilarities are computed using a distance metric in the vector space.

One implementation disclosed herein uses factorization, withdimensionality reduction, of the positive point-wise mutual information(PPMI) matrix of keyword phrase to “context” word co-occurrences.Context words are words that appear around the keyword phrases innatural language sentences, documents, or conversations. The reasoningbeing that two keyword phrases are similar if they have similar wordcontexts. To further capture and distinguish between longer distanceversus shorter distance contextual semantics, the context words can besegregated by zones or regions of distance away from the central keywordphrase. A diagram illustrating 3 zones is illustrated by 902.

Subsequently, the system disclosed herein parameterizes the contextzones by the window size, which defines how many word positions fallinto the zone, and by the zone offset, which defines how many positionsto shift the zone away from the central keyword phrase. In the exampleillustrated by 902, symmetric zones to the left and right of the keywordphrase are treated together but in other implementations, zones to theleft versus to the right of the keyword phrase may be trackedseparately, as well.

Subsequently, the system disclosed herein forms the co-occurrence matrixof keyword phrase to context words by counting the occurrences of eachpair of (w, c), where w is the keyword phrase and c is a context wordwithin a specific zone. In some implementations, the contribution of aco-occurring pair may be weighted by the position within the zone ordistance from the central keyword phrase. The co-occurrence values areaggregated over a large corpus of natural language text documents, suchas news articles, crawled websites, and Wikipedia articles. Theaggregated values are stored in a keyword-context matrix are illustratedat 904 for the example of three context zones.

The raw co-occurrence values are not a good measure of key-phrase tocontext word association because certain words and phrases naturallyoccur more frequently than others. Instead, an implementation disclosedherein uses point-wise mutual information to measure how informative acontext word is about a target key-phrase. For each cell in the matrixthe system computes the point-wise mutual information as:

${{pmi}( {w,c} )} = {\log\frac{p( {w,c} )}{{p(w)}{p(c)}}}$

where p(w, c) is the probability of the co-occurring keyword phrase andcontext word, p(w) is the probability of observing the keyword phrase,and p(c) is the probability of observing the context word. Largerpositive PMI values mean that the words co-occur more than if they wereindependent. In practice, negative values are unreliable when dealingwith extremely small probabilities and require large amounts of text andevidence, therefore in some implementations, only positive PMI valuesare considered, and negative values are replaced with 0 using:

ppmi(w,c)=max(0,pmi(w,c)).

In some implementations, the p(c) term is also modified to give rarecontext words higher probabilities because very rare words can skew PMIto large values, resulting in worse performance in the downstreamsemantic similarity tasks. One example modification is:

${p^{\prime}(c)} = \frac{{{count}(c)}^{\alpha}}{\Sigma_{c}\mspace{14mu}{{count}(c)}^{\alpha}}$

where the context counts are scaled to a power a that is between 0 and1, which has the effect of increasing the probability of rare contextwords. Another possible modification is add-k smoothing, which modifieseach count(c) by the addition of a positive value k, thus raising theminimum count of rare words.

Once the matrix of PPMI values is formed, it is factorized, e.g., usingSingular Value Decomposition, into a key-phrase-to-latent topic matrixmultiplying a latent topic-to-context matrix. The rows of thekey-phrase-to-latent topic matrix are the desired key-phrase vectorsfrom which the system computes similarities between every pair ofkeyword phrases.

The above paragraphs disclose only one technique to produce key-phrasevectors from which similarities may be computed. Other word embeddingtechniques include CBOW Word2Vec, Skip-gram Word2Vec, or GloVe, thoughthey may be used on single words rather than keyword phrases.

Using the similarity measure between all key-phrases, an implementationdisclosed herein produces a list of most similar key-phrases for everykey-phrase. This additional dataset synergizes well with other offeringson the system disclosed herein, particularly enhancing the ability forusers to search for companies using key-phrases which we havealgorithmically tagged companies with (described in other sections ofthis patent). Using the outputs of the keyword similarity computation,the system disclosed herein suggests related and similar keywords forthe user to add to their query. For example, when a user searches for“artificial intelligence” companies, we can automatically suggestadditional queries on “machine learning”, “deep learning”, “computervision”, and “ai”. This greatly reduces the burden on users to recall orthink of all possible variants of similar key-phrase queries, and mayeven introduce new concepts or terms that the user was not aware of.

Another application of the similarity measure between key-phrases is theenhancement of the algorithm described above for automatically taggingcompanies with keywords that describe the company's business. FIG. 4above showed how topics that exist within a document are used to boostthe relevance score of individual keywords that are related to thattopic. Using the computed similarities between keywords, each keywordcan boost another keyword's relevance score by considering how similarthey are to each other. Specifically, a clique of related keywords mayboost each other's relevance score because the fact that they co-occurwithin a document suggests that there is a related topic that thedocument is focused on.

FIG. 10 illustrates a flowchart with operations 1000 for determiningsimilarity of keywords. An operation 1002 generates a vectorrepresentation for each keyword phrase. One or more operations forgenerating the vector representation as per 1002 are illustrated by theblock 1004 of operations. An operation 1006 parameterizes the contextzones by window size, which defines how many word positions fall intothe zone, and by the zone offset, which defines how many positions toshift the zone away from the central keyword phrase. Operations 1008 and1010 count the context word occurrences in the desired context zonesaround keyword phrases.

Subsequently, an operation 1012 forms co-occurrence matrix of keywordphrase to context words by counting the occurrences of each pair of (w,c), where w is the keyword phrase and c is a context word within aspecific zone. An operation 1014 aggregates co-occurrence values over alarge corpus of natural language text documents, such as news articles,crawled websites, and Wikipedia articles. The aggregated values arestored in a keyword-context matrix at an operation 1016 as illustratedat 904.

An operation 1018 modifies the p(c) term to give rare context wordshigher probabilities because very rare words can skew PMI to largevalues, resulting in worse performance in the downstream semanticsimilarity tasks. Subsequently, an operation 1020 computes point-wisemutual information pmi (w, c). In some implementations, only thepositive PMI values pmmi are considered, and negative values arereplaced with 0. Subsequently, an operation 1022 factorizes the matrixof PPMI using Singular Value Decomposition, into a key-phrase-to-latenttopic matrix multiplying a latent topic-to-context matrix. An operation1024 computes similarities between pair of keyword phrases using therows of the key-phrase-to-latent topic matrix.

The implementations described herein are implemented as logical steps inone or more computer systems. The logical operations may be implemented(1) as a sequence of processor-implemented steps executing in one ormore computer systems and (2) as interconnected machine or circuitmodules within one or more computer systems. The implementation is amatter of choice, dependent on the performance requirements of thecomputer system being utilized. Accordingly, the logical operationsmaking up the implementations described herein are referred to variouslyas operations, steps, objects, or modules. Furthermore, it should beunderstood that logical operations may be performed in any order, unlessexplicitly claimed otherwise or a specific order is inherentlynecessitated by the claim language.

The above specification, examples, and data provide a completedescription of the structure and use of exemplary implementations. Sincemany implementations can be made without departing from the spirit andscope of the claimed invention, the claims hereinafter appended definethe invention. Furthermore, structural features of the differentexamples may be combined in yet another implementation without departingfrom the recited claims.

Embodiments of the present technology are disclosed herein in thecontext of an electronic market system. In the above description, forthe purposes of explanation, numerous specific details are set forth inorder to provide a thorough understanding of the present invention. Itwill be apparent, however, to one skilled in the art that the presentinvention may be practiced without some of these specific details. Forexample, while various features are ascribed to particular embodiments,it should be appreciated that the features described with respect to oneembodiment may be incorporated with other embodiments, as well. By thesame token, however, no single feature or features of any describedembodiment should be considered essential to the invention, as otherembodiments of the invention may omit such features.

In the interest of clarity, not all of the routine functions of theimplementations described herein are shown and described. It will, ofcourse, be appreciated that in the development of any such actualimplementation, numerous implementation-specific decisions must be madein order to achieve the developer's specific goals, such as compliancewith application—and business-related constraints, and that thosespecific goals will vary from one implementation to another and from onedeveloper to another.

According to one embodiment of the present invention, the components,process steps, and/or data structures disclosed herein may beimplemented using various types of operating systems (OS), computingplatforms, firmware, computer programs, computer languages, and/orgeneral-purpose machines. The method can be run as a programmed processrunning on processing circuitry. The processing circuitry can take theform of numerous combinations of processors and operating systems,connections and networks, data stores, or a stand-alone device. Theprocess can be implemented as instructions executed by such hardware,hardware alone, or any combination thereof. The software may be storedon a program storage device readable by a machine.

According to one embodiment of the present invention, the components,processes, and/or data structures may be implemented using machinelanguage, assembler, C or C++, Java and/or other high level languageprograms running on a data processing computer such as a personalcomputer, workstation computer, mainframe computer, or high performanceserver running an OS such as Solaris® available from Sun Microsystems,Inc. of Santa Clara, Calif., Windows Vista™, Windows NT®, Windows XPPRO, and Windows® 2000, available from Microsoft Corporation of Redmond,Wash., Apple OS X-based systems, available from Apple Inc. of Cupertino,Calif., or various versions of the Unix operating system such as Linuxavailable from a number of vendors. The method may also be implementedon a multiple-processor system, or in a computing environment includingvarious peripherals such as input devices, output devices, displays,pointing devices, memories, storage devices, media interfaces fortransferring data to and from the processor(s), and the like. Inaddition, such a computer system or computing environment may benetworked locally, or over the Internet or other networks. Differentimplementations may be used and may include other types of operatingsystems, computing platforms, computer programs, firmware, computerlanguages and/or general-purpose machines. In addition, those ofordinary skill in the art will recognize that devices of a lessgeneral-purpose nature, such as hardwired devices, field programmablegate arrays (FPGAs), application specific integrated circuits (ASICs),or the like, may also be used without departing from the scope andspirit of the inventive concepts disclosed herein.

In the context of the present invention, the term “processor” describesa physical computer (either stand-alone or distributed) or a virtualmachine (either stand-alone or distributed) that processes or transformsdata. The processor may be implemented in hardware, software, firmware,or a combination thereof.

In the context of the present technology, the term “data store”describes a hardware and/or software means or apparatus, either local ordistributed, for storing digital or analog information or data. The term“Data store” describes, by way of example, any such devices as randomaccess memory (RAM), read-only memory (ROM), dynamic random accessmemory (DRAM), static dynamic random access memory (SDRAM), Hash memory,hard drives, disk drives, floppy drives, tape drives, CD drives, DVDdrives, magnetic tape devices (audio, visual, analog, digital, or acombination thereof), optical storage devices, electrically erasableprogrammable read-only memory (EEPROM), solid state memory devices andUniversal Serial Bus (USB) storage devices, and the like. The term “Datastore” also describes, by way of example, databases, file systems,record systems, object oriented databases, relational databases, SQLdatabases, audit trails and logs, program memory, cache and buffers, andthe like.

The above specification, examples and data provide a completedescription of the structure and use of exemplary embodiments of theinvention. Although various embodiments of the invention have beendescribed above with a certain degree of particularity, or withreference to one or more individual embodiments, those skilled in theart could make numerous alterations to the disclosed embodiments withoutdeparting from the spirit or scope of this invention. In particular, itshould be understood that the described technology may be employedindependent of a personal computer. Other embodiments are thereforecontemplated. It is intended that all matter contained in the abovedescription and shown in the accompanying drawings shall be interpretedas illustrative only of particular embodiments and not limiting. Changesin detail or structure may be made without departing from the basicelements of the invention as defined in the following claims.

What is claimed is:
 1. A computer-implemented method, wherein one ormore computing devices comprising storage and a processor are programmedto perform steps comprising: generating a count for keyword phrases andtopics extracted from a corpus of documents, the topics being associatedwith the extracted keyword phrases or a portion of the extracted keywordphrases; determining document frequencies (DF) for each extractedkeyword phrase across the corpus of documents; applying a term-frequency(TF)-inverse-document-frequency (IDF) (TF-IDF) transformation to each ofthe extracted keyword phrases to generate a respective plurality ofTF-IDF vectors; determining a strength of each topic based on a numberof extracted keyword phrases associated with that respective topic;determining an edge weight based on a linkage of the topic with anassociated extracted keyword phrase; generating relevance scoresrelating each extracted keyword phrase to the respective company basedon a strength of each of the extracted keyword phrases, the strength ofeach topic, and the edge weight for each topic, the strength of each ofthe extracted keyword phrases being equal to a TF-IDF vector associatedwith the extracted keyword phrase; applying a representation learningtechnique to the plurality of TF-IDF vectors and the relevance scores togeneralize each respective company into at least one of a plurality oftopic spaces; segmenting the plurality of companies into clusters byapplying a clustering technique to the extracted keyword phrases foreach respective company or to the plurality of topic spaces; andoutputting the clusters of companies with respective business tags. 2.The method of claim 1, further comprising generating similarity betweentwo of the extracted keyword phrases based on a distance metric betweenthe two extracted keyword phrases and determining the strength of eachtopic based on the number of extracted keyword phrases and thesimilarity between the two extracted keyword phrases.
 3. The method ofclaim 2, wherein the distance metric includes a distance between the twoextracted keyword phrases as either cosine distance or Euclidiandistance.
 4. The method of claim 1, further comprising generatingsimilarity between two of the extracted keyword phrases based on apositive point-wise mutual information (PPMI) matrix of the twoextracted keyword phrases to context words and determining the strengthof each topic based on the number of extracted keyword phrases and thesimilarity between the two extracted keyword phrases.
 5. The method ofclaim 4, further comprising segregating the context words by regions ofdistances away from a central keyword phrase.
 6. The method of claim 4,further comprising generating a co-occurrence matrix of the twoextracted keyword phrases to context words by counting the occurrencesof each pair of (w, c), wherein w is the extracted keyword phrase and cis a context word within a specific zone.
 7. The method of claim 1,further comprising segmenting the plurality of companies into a firstcluster and a second, overlapping cluster.
 8. The method of claim 1,further comprising segmenting the plurality of companies into a firstcluster and a second, non-overlapping cluster.
 9. The method of claim 1,further comprising segmenting the plurality of companies into a firstcluster and a second cluster that is larger than the first cluster. 10.The method of claim 1, further comprising segmenting the plurality ofcompanies into a first cluster and a second cluster that isapproximately the same size as the first cluster.
 11. The method ofclaim 1, further comprising segmenting the plurality of companies into afirst cluster and a second cluster, and extracting keywords for each ofthe first cluster and the second cluster.
 12. The method of claim 11,further comprising generating the relevance scores relating to the firstcluster and the second cluster based on a strength of each of theextracted keywords for the first cluster and the second cluster,respectively, and outputting the relevance scores relating to the firstcluster and the second cluster.
 13. A system, comprising: a processorconfigured to: generate a count for keyword phrases and topics extractedfrom a corpus of documents, the topics being associated with theextracted keyword phrases or a portion of the extracted keyword phrases;determine document frequencies (DF) for each extracted keyword phraseacross the corpus of documents; apply a term-frequency(TF)-inverse-document-frequency (IDF) (TF-IDF) transformation to each ofthe extracted keyword phrases to generate a respective plurality ofTF-IDF vectors; determine a strength of each topic based on a number ofextracted keyword phrases associated with that respective topic;determine an edge weight based on a linkage of the topic with anassociated extracted keyword phrase; generate relevance scores relatingeach extracted keyword phrase to the respective company based on astrength of each of the extracted keyword phrases, the strength of eachtopic, and the edge weight for each topic, the strength of each of theextracted keyword phrases being equal to a TF-IDF vector associated withthe extracted keyword phrase; apply a representation learning techniqueto the plurality of TF-IDF vectors and the relevance scores togeneralize each respective company into at least one of a plurality oftopic spaces; create segments of the plurality of companies intoclusters by applying a clustering technique to the extracted keywordphrases for each respective company or to the plurality of topic spaces;and an output configured to transmit the clusters of companies withrespective business tags to another computing device, network, orsystem.
 14. The system of claim 13, wherein the processor is furthercomprised to generate similarity between two of the extracted keywordphrases based on a distance metric between the two extracted keywordphrases and determine the strength of each topic based on the number ofextracted keyword phrases and the similarity between the two extractedkeyword phrases.
 15. The system of claim 13, wherein the processor isfurther configured to generate similarity between two of the extractedkeyword phrases based on a positive point-wise mutual information (PPMI)matrix of the two extracted keyword phrases to context words anddetermine the strength of each topic based on the number of extractedkeyword phrases and the similarity between the two extracted keywordphrases.
 16. The system of claim 15, wherein the processor is furtherconfigured to segregate the context words by regions of distances awayfrom a central keyword phrase.
 17. The system of claim 15, wherein theprocessor is further configured to generate a co-occurrence matrix ofthe two extracted keyword phrases to context words by counting theoccurrences of each pair of (w, c), wherein w is the extracted keywordphrase and c is a context word within a specific zone.
 18. The system ofclaim 13, wherein the processor is further configured to create segmentsof the plurality of companies into a first cluster and a second,overlapping cluster or a first cluster and a second, non-overlappingcluster.
 19. The system of claim 13, wherein the processor is furtherconfigured to create segments of the plurality of companies into a firstcluster and a second cluster that is larger than the first cluster orapproximately the same size as the first cluster.
 20. The system ofclaim 13, wherein: the processor is further configured to: createsegments of the plurality of companies into a first cluster and a secondcluster, extract keywords for each of the first cluster and the secondcluster, and generate t h e relevance scores relating to the firstcluster and the second cluster based on a strength of each of theextracted keywords for the first cluster and the second cluster,respectively, and the output is further configured to output therelevance scores relating to the first cluster and the second cluster.